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ABSTRACT 


1. Introduction 


The World Health Organization recently revealed fatality figures, and they show an alarming amount of traffic 
accidents worldwide every year. 1.2 million people were killed in automobile accidents. 50 million persons were 
hurt every year. Every day, there were roughly 3,300 fatalities and 137,000 injuries. 43 billion dollars in direct 
economic damages, as well as a direct threat to human life and property safety posed by traffic accidents that 


frequently occur. 


One of the most crucial areas of traffic safety research is the prediction of road accidents. Road geometry, traffic 
flow, driver traits, and road environment all have a major impact on the likelihood of traffic accidents. Many 
research have been carried out to forecast accident frequencies and evaluate the elements of traffic accidents, 
including studies on the identification of dangerous locations/hot spots, the analysis of accident injury-severities, 
and the study of accident length. Several research concentrate on the accident's mechanism. Weather and the 


visibility of the road are other concerns. 


India is experiencing an increase in accidents, which is very concerning. Recent data indicates that India is 
responsible for 6% of all traffic accidents worldwide. Two-wheelers' irresponsibility is a major cause of recorded 
accidents, and over-speeding is another contributing element. Accidents brought on by drunk driving or other types 
of traffic offences happen frequently as well. Despite having established rules and traffic codes, a number of 
accidents have been caused by people being careless with the speed of their vehicles, the condition of their vehicles, 
and their failure to wear helmets. Although the growing number of vehicles is the primary contributor to traffic 
accidents, the importance of the state of the roads and other environmental elements cannot be understated. In 
India, the amount of fatalities from traffic accidents is undoubtedly concerning. With more than 137,000 persons 
suffering injuries as a result of traffic accidents, the situation is exceedingly grim. This number is more than four 


times the number of people killed by terrorism each year. Accidents involving large trucks and even those 
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involving buses, which are utilised for public transit, are among the deadliest kinds of accidents that can happen and 


cost the lives of innocent people. 


Rain, fog, and other weather conditions can greatly increase the likelihood of accidents. So, it would be easier to 
take action to reduce accidents if you have a proper estimation of incidents and are aware of accident hotspots and 
contributing variables. This necessitates a careful examination of events and the creation of models for predicting 
accidents. In this essay, we'll talk about how an accident prediction model can help determine the risks that go along 


with various scenarios involving accidents. 


A probabilistic model linking important crash precursors to changes in crash potential was created by Lee et al. 
Using the matched case-control logistic regression technique, Abdel [10] created a previous crash prediction 
model. The traffic police have no unique method for predicting which location is accident-prone at a given time. In 
order to effectively plan and manage traffic, it is vital to predict traffic accidents because they often involve 
nonlinear factors like people, cars, roads, weather, and other factors that are very random. Because of the noise 
pollution and the small amount of data, typical linear analyses are unable to reflect the true situation, and the 
prediction's outcome is insufficient. Traditional BP networks suffer from flaws including local minimum, excessive 


iterations, delayed training, and other issues. 
“2. Related Works 


Several researchers have employed deep learning for picture classification [1], text mining, fake news 
identification, and text classification, among other applications of data mining. Many scholars have thought about 
employing various data mining techniques to analyze traffic accident data. Several studies examined the severity of 


traffic accidents in various nations. 


Using some parameter selection techniques, they chose the 16 parameters that have the greatest impact on the 
severity of the drivers’ injuries out of the 150 total parameters. To categorise the degree of injury severity, ANN was 
used. They managed to obtain a somewhat poor accuracy of 40.71%. Regression models are the fundamental 
elements for any type of data analysis with the link between the explanatory variable and response variable. They 
discovered that the model's sensitivity and specificity were 40% and 98%, respectively, at the probability cut-point 
of 0.20. Also, their study's findings demonstrate the significance of velocity, seat belt use, and impact direction in 


determining how serious an accident would be. To analyse quality accidents, authors suggested fuzzy rule mining. 


Three classes were forecasted as a collection of binary prediction models during the analysis process, helping to 
increase the projected model's accuracy, which was achieved at 60.94%. and the crucial factor that influences how 
serious accidents are is incorrect overtaking and seat belt neglect. Road accident analysis was done by Sharma et al. 
using a Support Vector Machine and multi-layer perception. They only used a few data samples for their 
experiments. They stated that the event was caused by high-speed driving while intoxicated. For classification and 
grouping, Tiwari et al. [11] employed machine learning models such as Decision Tree (DT), Naive Bayes (NB), 


and Support Vector Machine (SVM). With the cluster dataset, they got superior results. 


The ability to predict the seriousness of incidents on the road is still being developed. Prediction accuracy will 


increase with the acquisition of a suitable strategy. Choosing the optimal paradigm also aids in determining the 
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causes of traffic accidents. Moreover, elements that are more pertinent to the aim can aid machine learning models in 


producing better predictions for outcomes that were not previously known. 
“= 3, Proposed System 
A. Proposed Road Accident Prediction System 


An Machine Learning approach that gauges accident severity from the circumstances. With 1.6 million accident 
records from 2005 to 2015, it has been trained. Accuracy improves with more information. Such a model's goal is to 
be able to foretell which circumstances will be more likely to result in accidents. In order to give quicker care and 
preventative measures, we will even aim to discover potential incidents with greater accuracy. The approaches for 
estimating the route from the provided dataset are discussed in this section. The classification algorithm, such as 
Logistic Regression, Decision Tree, or Random Forest, used to categorise a dataset for the prediction of traffic 
accidents. To anticipate the accident severity, I have tested three alternative algorithms. In terms of predicting all the 
classes of accident severity, it was evident that Decision Tree and Random Forest outperformed each other 
significantly. Although logistic regression is more accurate, this does not necessarily imply that it is the best 
algorithm. To forecast all the classes in the section on hyperparameter tuning, authors even attempted using 
multi-normal. Still, it only predicted one higher occurring class event. The conclusion is that Logistic regression has 
an accuracy of 86.23%, Decision tree has an accuracy of 75.26%, and Random forest has an accuracy of 86.86%. It 
is obvious that Random Forest, with its precise accuracy of 86.86%, produces the best results..picture enhancement 


are some of the pre-processing methods employed in this system. 


The suggested system's block diagram is shown in Figure 1. 


Model Evaluation 
Accident Severity 
Prediction 


Road Accident 
Dataset 


Preprocessing 


Feature 
Importance 


Figure 1. Block diagram of the proposed system 


Data Splitting 


Random Forest 


Logistic Regression 


Trained Model 


B. Data Importing 


To analyse this data, three files are imported. This information is divided into three files: accidents, casualties, and 
vehicles. We do, however, have one more file that contains generic data concerning traffic counts from 2000 to 


2015. For the machine learning portion, we can use data on general traffic patterns. 
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e The necessary package imports are completed. 

e 3 CSV documents Accidents.csv Casualties.csv Vehicles.csv were used. 
e Importing data into a data frame using pandas. 

C. Applied Machine Learning Techniques 

(i) Logistic Regression 


The supervised classification technique known as logistic regression predicts a result that has two alternative values, 
such as zero or one, no or yes, false or true. The likelihood of a binary dependent variable being predicted from the 
dataset's independent variable is given by logistic regression. Despite the obvious similarities between logistic 
regression and linear regression, logistic regression produces a curve as opposed to a straight line. Using one or more 
independent variables or predictors, logistic regression creates logistic curves that represent values between 0 and 1. 
An analysis of the relationship between a number of independent factors and a categorical dependent variable is 


performed using a regression model called logistic regression. 


e& +B ,,X 


P = 1 + ettBb,X 


(ti) Decision Tree 


A decision tree method uses conditional control statements to predict the final choice by creating a tree-like graph or 
model of options and possible outcomes. A decision tree is a tool for approaching discrete-valued target functions, 
and it is represented as a learned function. These algorithms are well recognised for facilitating inductive learning 
and have been successfully used to a number of tasks. The decision tree is evaluated against the transaction value 
before a path is displayed from the root node to the transaction's output or class label. This process is done for each 


new transaction to determine whether it is legitimate or fraudulent. 


Entropy(S) = > —p; loge pi 
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Gain(S, A) = Entropy(S) — y, Ts] Entropy Sv) 
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(iii) Random Forest 


A technique for classification and regression is called Random Forest. It is, in essence, a collection of decision tree 
classifiers. Random forest has an advantage over decision trees since it corrects the propensity of overfitting to their 
training set. After a decision tree has been built, each node is divided on a feature selected at random from the entire 
feature set. A subset of the training set is used to train each individual tree. Even for large data sets with various 
attributes and data occurrences, training is remarkably quick in a random forest since each tree is trained 
independently of the others. It has been found that the Random Forest approach is resistant to overfitting and 


provides a good approximation of the generalisation error. 


x 
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(iv) Hyper Parameter Tuning 
The performance of ML algorithms for prediction can be significantly impacted by HP tuning [9]. A ML algorithm's 
HPs are often configured through a process of trial and error. Finding a decent set of numbers manually can take a lot 


of time, depending on how long the ML algorithm being used needs to train. Because of this, current research on HP 


for ML algorithms has focused on improving HP tuning strategies [5]. 


The HP process is typically viewed as an optimization (blackbox) issue, with the algorithm's objective function 


being the predictive accuracy of the resulting model. 
D. Performance Evaluation Measures 
(i) Accuracy 


Several parameters have been represented in the evaluation of models, a crucial task in categorization. The most 
widely used evaluation criteria used in this study are accuracy, precision, recall, and F-score. Accuracy is defined 


formally as the correctness of a forecast and is computed as follows: 


Number of Correct Predictions 
Total Number of Predictions 


Accuracy = 


Accuracy can be calculated in terms of positives and negatives in binary classification: 


TP+TN 
TP + FP+TN + FN 


Accuracy = 


(ii) Precision 


The preciseness of a classifier is referred to as precision, and it indicates what proportion of all tuples with a positive 


label are genuinely positive. It is determined by: 


TP 


Precision =——¥! 
TP + FP 


(iti) F-Measure/F 1-Score 


The F-score is a statistical analysis metric for categorization that computes a score between 0 and 1 while taking the 
classifier's recall and precision into account. It is calculated as follows to demonstrate the impact of both recall and 


precision: 


__ 2xprecision x Recall 
Precision + Recall 


F 


(iv) Recall 


On the other hand, recall is frequently referred to as the measure of completeness and it displays the proportion of 


true positive tuples that are correctly classified. It is determined by: 


TP 


Recal] =———__ 
TP+ FN 


ISSN: 2582-0974 [68] OPEN@ ACCESS 


a \ Middle East Journal of Applied Science & Technology (MEJAST) 
M E J A S T Volume 6, Issue 2, Pages 64-75, April-June 2023 
E. Data Pre-processing 

(i) Data Cleaning 

Here, we define noisy, pointless data. Visualization also helps us determine which elements are more crucial. 

(ii) Data Visualization 

The first thing we can do is learn the date and time of the collision, as well as the age of some of the drivers who were 
involved. We can determine the frequency of accidents based on the days of the week. Hours of the day can be used 


to determine the amount of accidents, and the driver's age can provide additional information. 
(a) Accidents on Day of Week 


The amount of accidents can be determined based on the days of the week. As we can see, from 2005 to 2015, 
Thursday had the largest number of accidents in this dataset. We must remember that the number of accidents may 


vary based on the volume of traffic on a certain day. 


Accidents on the day of a week 


Accident count 


H 2 3 4 5 6 
0- Sunday, 1- Monday ,2- Tuesday , 3 - Wednesday , 4 - Thursday , 5 - Friday , 6 - Saturday 


Figure 2. Accidents on the day of a Week 
(b) Time of Accident 


Here, we discovered that accidents tended to occur more frequently after midday. We can presume that the greatest 


traffic is flowing at this time of day because people are likely departing for work. 
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Figure 3. Time of Accident 
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(c) Age Band of Casualties 


This fact regarding this dataset is quite intriguing. Most of the drivers who are in accidents are between the ages of 
25 and 35. Nevertheless, we are unsure about the proportion of drivers between the ages of 25 and 35 compared to 


those of other ages. I would predict that there would be a higher proportion of drivers between the ages of 25 and 35. 


Age of people involved in the accidents 
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Age of Drivers 
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Accident count 


Figure 4. Age of People involved in accident 


(d) Co-relation between variables 
Our dataset only contains numeric values. We can determine whether two columns are correlated. We can see that 
there aren't many variables that have substantial relationships with one another. Speed restriction and Urban or Rural 


Area only have one substantial positive link. 


Figure 5. Co-relation 


(e) Speed of Cars 


The majority of collisions happened on roads with 30 mph or higher speed limits. Accidents may occur as a result of 


stop signs, lane changes, turning into parking lots, etc. 
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Accidents percentage in Speed Zone 
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Figure 6. Accident Percentage 
(f) Normalize the Data 


We will standardise a small number of columns, so our machine learning algorithms won't be badly impacted. In the 
dataset, the age of the drivers ranges from 18 to 88, and we may standardise it. Also, the age of the car ranges from 0 
to 100, which can affect how well your machine learning algorithm performs. We will normalise this prediction as 
well. 
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In order to conduct Python programming language experiments, I used Jupyter notebook environment. Sklearn is 


used to implement machine learning models. The outcomes of the Logistic Regression, Decision Tree, and Random 


Forest analyses performed on the dataset of traffic accidents are shown in this section. 


(1) Logistic Regression Result 
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precision 
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(3) Random Forest Result 
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(Accuracy, 85.71) 
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(6) Overall Accuracy 


4111 
38151 
264697 


306959 
306959 
306959 


support 


4111 
38151 
264697 


306959 
306959 
306959 


S. No. | Algorithm Name Accuracy Score (%) 
1. Decision Tree 75.26 
2. Random Forest 86.86 
3. | Logistic Regression 86.23 
4. Decision Tree Hyperparameter Tuning 85.74 
5. Logistic Regression Hyperparameter Tuning 86.23 


= 5, Conclusion 


The primary cause of injuries, fatalities, and property damage are traffic accidents, which have grown to be a 
serious problem for public health and safety. Accidents also contribute to traffic congestion and delays. It is 
necessary to manage accidents by looking at connected aspects in order to increase the effectiveness of the 
transportation system. In this study, a machine learning model called Logistic Regression, Random Forest, and 


Decision Tree is used to predict the severity level of traffic accidents. The experimental findings in this research 
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revealed why Random Forest classification results outperformed Logistic Regression and Decision Tree. The top 
features include distance, temperature, wind Chill, humidity, visibility, and wind direction, and significant features 
are identified using Random Forest and Decision Tree, respectively. Given that Random Forest consistently 
outperformed all ensemble models in predicting accident severity, it may be claimed that it is the most efficient and 
effective model of all. On the other hand, measuring the link between significant features and traffic accidents was 
the main goal of distinguishing significant features from general features. As a future work, more resources with 
continuous prediction and alerts can be sent to the police for every location at regular intervals of time to take 
preventive measures. It can be incorporated with Google Maps which can be live tracked by the police. A 
fully-fledged web app for user and police interaction can be published for use in real-time. It can be used for Indian 


states or cities, if proper data of accidents is provided by the Indian Government. 
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